Overview of the workflow

The given FASTQ files (AH_S1 and CH_S2) were analyzed by a standard analytical pipeline to map the reads to a reference, hg19. The datasets were firstly compared based on the results of FastQC following skewer. The GATK Picard resulted in aligned BAM files by sequencing alignment and several post-processing including MarkDuplicates. Lastly, the variations were called by Mutect2.

Figure 1. Schematic of AH_S1 and CH_S2 analysis workflow


Sequencing read quality measured by FastQC

The quality of the two datasets were measured by FastQC after trimming adapter sequences from raw FASTQ files. Two files per each dataset were processed and the outputs were summarized in Figure 2. Among 12 QC categories, AH_S1 fails in 4 categories with 1 warning and CH_S2 fails in 2 categories with 3 warnings.

\label{fig:figs}Figure 2. Summary of FastQC results

Figure 2. Summary of FastQC results


Comparison of aligned sequences (BAM)

Targeted sequencing data

The output BAM files were analyzed by Picard CollectTargetedPcrMetrics to calculate the PCR based metrics for targeted regions (Table 1)

Table 1. Collected Targeted PCR metrics of AH_S1 and CH_S2
AH_S1 CH_S2
CUSTOM_AMPLICON_SET AH_S1_target CH_S2_target
GENOME_SIZE 3101804739 3101804739
AMPLICON_TERRITORY 23498 21282
TARGET_TERRITORY 23498 21282
TOTAL_READS 2000000 2000000
PF_READS 2000000 2000000
PF_BASES 300000000 292610263
PF_UNIQUE_READS 20892 1155004
PCT_PF_READS 1 1
PCT_PF_UQ_READS 0.010446 0.577502
PF_UQ_READS_ALIGNED 14403 1150635
PF_SELECTED_PAIRS 948114 772114
PF_SELECTED_UNIQUE_PAIRS 4029 403052
PCT_PF_UQ_READS_ALIGNED 0.689403 0.996217
PF_BASES_ALIGNED 283454480 285805025
PF_UQ_BASES_ALIGNED 1812176 163630869
ON_AMPLICON_BASES 195086105 85441202
NEAR_AMPLICON_BASES 75600787 137948372
OFF_AMPLICON_BASES 12767588 62415451
ON_TARGET_BASES 719566 42691658
ON_TARGET_FROM_PAIR_BASES 713570 42639040
PCT_AMPLIFIED_BASES 0.954957 0.781615
PCT_OFF_AMPLICON 0.045043 0.218385
ON_AMPLICON_VS_SELECTED 0.720708 0.382476
MEAN_AMPLICON_COVERAGE 8302.243 4014.717
MEAN_TARGET_COVERAGE 30.62244 2005.998
MEDIAN_TARGET_COVERAGE 25 200
FOLD_ENRICHMENT 90850.34 43571.2
ZERO_CVG_TARGETS_PCT 0 0
PCT_EXC_DUPE 0.993607 0.427474
PCT_EXC_MAPQ 0.000362 0.011102
PCT_EXC_BASEQ 0 7.2e-05
PCT_EXC_OVERLAP 0 0
PCT_EXC_OFF_TARGET 0.006004 0.5595
FOLD_80_BASE_PENALTY 1.80132 10.02999
PCT_TARGET_BASES_1X 0.999957 1
PCT_TARGET_BASES_2X 0.999957 1
PCT_TARGET_BASES_10X 0.98336 1
PCT_TARGET_BASES_20X 0.695463 1
PCT_TARGET_BASES_30X 0.354328 1
AT_DROPOUT 12.71635 4.323885
GC_DROPOUT 1.916709 0.544665
HET_SNP_SENSITIVITY 0.995386 1
HET_SNP_Q 23 -1

The read depth in commonly targeted regions

Both AH_S1 and CH_S2 targeted regions were calculated by bedtools and used to measure the read depth of those regions in the aligned BAM files by applying htseq. The distribution of the read depth is shown in Figure 3 and the read depth of each position is plotted in Figure 4 which prove consitant read depth of CH_S2.
Figure 3. Histogram of depths of AH_S1 and CH_S2

Figure 3. Histogram of depths of AH_S1 and CH_S2

Figure 4. Sequencing depth of common regions of AH_S1 and CH_S2

Figure 4. Sequencing depth of common regions of AH_S1 and CH_S2


Variant calling

Those aligned BAM files were used for variant calling compared with hg19 reference as normal tissue sequecing dataset was not given. Mutect2 identified 101 and 219 variation locations for AH_S1 and CH_S2 respectively and 25 of them is common (Figure 5). The common variant calls were annotated by VEP and Table 2 highlights nonsynonymous mutations involving mutational hotspots or Cancer Hotspots.

\label{fig:figs}Figure 5. Overlaps of variant calls between AH_S1 and CH_S2 samples

Figure 5. Overlaps of variant calls between AH_S1 and CH_S2 samples

Table 2. Nonsynonymous mutation calls among commonly detected in AH_S1 and CH_S2
Location Allele Consequence IMPACT SYMBOL Protein_position Amino_acids
1:115256530-115256530 T missense_variant MODERATE NRAS 61 Q/K
12:25398281-25398281 T missense_variant MODERATE KRAS 13 G/D
15:66727451-66727451 C missense_variant MODERATE MAP2K1 56 Q/P
15:66729147-66729147 T missense_variant MODERATE MAP2K1 119 H/Y
15:90631917-90631918
frameshift_variant HIGH IDH2 145 G/X
15:90631934-90631934 T missense_variant MODERATE IDH2 140 R/Q
17:7578388-7578388 T frameshift_variant HIGH TP53 181 R/QX
17:7579472-7579472 C missense_variant MODERATE TP53 72 P/R
3:178936082-178936082 A missense_variant MODERATE PIK3CA 542 E/K
3:178952085-178952085 G missense_variant MODERATE PIK3CA 1047 H/R
3:41266101-41266101 A missense_variant MODERATE CTNNB1 33 S/Y
3:41266133-41266136
inframe_deletion MODERATE CTNNB1 44-45 PS/P
7:140453136-140453136 C missense_variant MODERATE BRAF 600 V/G
7:55241707-55241707 A missense_variant MODERATE EGFR 719 G/S
9:80409488-80409488 A missense_variant MODERATE GNAQ 209 Q/L

Conclusion

This report shows the workflow to analyze two different NGS library sequencing results. Overall, CH_S2 could be a better choice than AH_S1 considering the sequecing quality, consistency of read depth and also extent of variation calls.

Session Information

The Rmd file of this report and source codes are available in my repository github.

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252 
## [2] LC_CTYPE=English_United Kingdom.1252   
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## attached base packages:
## [1] grid      parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] RColorBrewer_1.1-2  VennDiagram_1.6.20  futile.logger_1.4.3
##  [4] vcfR_1.12.0         kableExtra_1.3.1    ngsReports_1.2.0   
##  [7] tibble_3.0.4        ggplot2_3.3.2       BiocGenerics_0.32.0
## [10] BiocStyle_2.14.4   
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-143                bitops_1.0-6               
##  [3] matrixStats_0.57.0          lubridate_1.7.9.2          
##  [5] webshot_0.5.2               httr_1.4.2                 
##  [7] GenomeInfoDb_1.22.1         tools_3.6.1                
##  [9] vegan_2.5-6                 R6_2.5.0                   
## [11] mgcv_1.8-31                 lazyeval_0.2.2             
## [13] colorspace_2.0-0            permute_0.9-5              
## [15] withr_2.3.0                 tidyselect_1.1.0           
## [17] compiler_3.6.1              rvest_0.3.6                
## [19] Biobase_2.46.0              formatR_1.7                
## [21] flashClust_1.01-2           xml2_1.3.2                 
## [23] DelayedArray_0.12.3         plotly_4.9.2.1             
## [25] ggdendro_0.1.22             scales_1.1.1               
## [27] stringr_1.4.0               digest_0.6.27              
## [29] Rsamtools_2.2.3             rmarkdown_2.5              
## [31] XVector_0.26.0              jpeg_0.1-8.1               
## [33] pkgconfig_2.0.3             htmltools_0.5.0            
## [35] FactoMineR_2.3              highr_0.8                  
## [37] htmlwidgets_1.5.2           rlang_0.4.8                
## [39] rstudioapi_0.13             visNetwork_2.0.9           
## [41] generics_0.1.0              farver_2.0.3               
## [43] zoo_1.8-8                   hwriter_1.3.2              
## [45] jsonlite_1.7.1              BiocParallel_1.20.1        
## [47] dplyr_1.0.2                 RCurl_1.98-1.2             
## [49] magrittr_2.0.1              GenomeInfoDbData_1.2.2     
## [51] leaps_3.1                   Matrix_1.2-18              
## [53] Rcpp_1.0.5                  munsell_0.5.0              
## [55] S4Vectors_0.24.4            ape_5.4-1                  
## [57] lifecycle_0.2.0             scatterplot3d_0.3-41       
## [59] stringi_1.4.6               yaml_2.2.1                 
## [61] MASS_7.3-51.5               SummarizedExperiment_1.16.1
## [63] zlibbioc_1.32.0             plyr_1.8.6                 
## [65] pinfsc50_1.2.0              ggrepel_0.8.2              
## [67] crayon_1.3.4                lattice_0.20-38            
## [69] splines_3.6.1               Biostrings_2.54.0          
## [71] pander_0.6.3                knitr_1.30                 
## [73] pillar_1.4.7                GenomicRanges_1.38.0       
## [75] reshape2_1.4.4              stats4_3.6.1               
## [77] futile.options_1.0.1        glue_1.4.2                 
## [79] evaluate_0.14               ShortRead_1.44.3           
## [81] latticeExtra_0.6-29         memuse_4.1-0               
## [83] lambda.r_1.2.4              data.table_1.13.2          
## [85] BiocManager_1.30.10         vctrs_0.3.4                
## [87] png_0.1-7                   gtable_0.3.0               
## [89] purrr_0.3.4                 tidyr_1.1.2                
## [91] xfun_0.19                   viridisLite_0.3.0          
## [93] truncnorm_1.0-8             GenomicAlignments_1.22.1   
## [95] IRanges_2.20.2              cluster_2.1.0              
## [97] DiagrammeR_1.0.6.1          ellipsis_0.3.1